World Data League 2022

Notebook Submission Template

This notebook is one of the mandatory deliverables when you submit your solution. Its structure follows the WDL evaluation criteria and it has dedicated cells where you should add information. Make sure your code is readable as it will be the only technical support the jury will have to evaluate your work. Make sure to list all the datasets used besides the ones provided.

🎯 Challenge

Predict Waste Production for its Reduction

👥 Authors

💻 Development

Start coding here! 🐱‍🏍

1.Import packages

2. Import data

We are importing data from the following sources:

  1. data.austintexas.gov - Waste Collection Diversion Daily Report. For more details, see here.
  2. Google Mobility data for Travis county, where the city of Austin, see here.

3. EDA

Summary of main conclusions per subsection:

  1. Austin waste DataFrame information: missing values and multiplicity:
  1. Generating time-series and checking the seasonality of the components:
  1. Evolution of the main waste components as a function of time as a percentage of total:
  1. What is the dropsite location of each type of waste?

Although we have not use this information as an input for the model, we think that knowledge of the main collection centers - together with the most meaningful routes (point 5 below) - allows to better design the waste management routes, minimizing the carbon emissions of the waste management trucks.

  1. Analyzing Dropsite and what are the paths that contribute the most to the total

1. DataFrame information: missing values, column multiplicity and largest waste contributors

Some missing values in the load_weight column (see below), otherwise everything looks fine. Let us take a look at the mulitplicity of the columns:

Check where the missing values of load weight come from:

The missing values in SWEEPING might be due to the fact that sweeping load weights are too small to be reported.

Transform report date into a Datetime object and sort the DataFrame per this column, adding a couple of assistance columns.

Total amount of trash generated between '2012-01-01' and '2021-12-31':

As we shall see below, garbage collection is the main source of waste in Austin. The second source of waste is single-stream recycling (all material to be recycled - paper, plastic, metal - is placed in a single container). This is a rather inefficient way of recyling because 1) it needs to be separated by a third part, 2) it often leads to items not being able to be recycled. The third one is yard trimming (which is not too surprising given American's predilection for large grassy yards).

In the past few years, organics has been gaining relevance.

2. Generating time-series and checking the seasonality of the components

The reason why July 2021 has a smaller total waste weight is because data stopped being collected in the 11th of July.

Some comments:

  1. Most often, the peak of waste collection occurs at the end of March, beginning of April. What is happening during these dates? Could it be the famous festival, South by Southwest?
  2. January has a local maximum - might be due to waste collection after NYE.
  3. The amount of waste then decreases throughout Jan and Feb, reaches a max in March/April, them decreases throughout the rest of the year until we reach November/December. (People go on holidays, grass does not need to be cut?)

What about the seasonality of certain components?

As expected, yard trimming increases in the Spring months.

3. Evolution of the main waste components as a function of time as a percentage of total

Unsuprisingly, regular garbage accounts for more than 50% of all the trash in Austin, whereas recycling is only about 25%. Note that both components have remained relatively steady throughout the years - showing how much progress there is to be done in waste management - with yard trimming (i.e, cut grass) decreasing since 2017.

The percentage of recycled waste and garbage collections decreased in 2021 - is this due to the fact that the big 6 above no longer account for most of the trash, or because there is change in the components? See below the answer for these two questions.

The increase in organics is due to a program from the city of Austin for curbside composite collection that started in 2017 (maybe 2017 itself was not registered in the database, or it was under a different name). In 2021, the program was expanded. Furthermore, yard trimmings are now collected as part of this, which explains why their volume reduces dramatically from Feb 2021 onwards:

The program collects food scraps, yard trimmings, food-soiled paper and natural fibers, and converts them into nutrient-rich compost. Because materials are processed in a commercial composting facility, extremely high temperatures are reached, allowing Austinites to compost items like meat, dairy, seafood and bones that typically cannot be composted in a backyard. This program is part of the City of Austin’s zero waste goal to divert 90 percent of materials from landfills by 2040.

Indeed, the 6 main components described above account for most of the waste, thereby generating an almost closed system, where changes in one of the components is accounted for another/several components in the big 6.

In 2021, the decrease in yard trimming and garbage collections was basically due to the increase in organics.

4. What is the dropsite location of each type of waste?

Waste per destination in more detail as a function of time:

Conclusions from the charts above:

  1. Undifferentiated garbage goes basically to one place, the TDS Landfill;
  2. Recycled products go to 3 main locations - Balcones, TDS - MRF (which was the most import point in 2012, but then lost relevance to Balcones) and TDS Landfill (where I think they also recycle);
  3. Yard trimming goes mosty to Hornsby Bend, with Organics by Gosh playing a more relevant role recently;

5. Analyzing Dropsite and what are the paths that contribute the most to the total

Checking what are the paths that most contribute to the overall amount of garbage.

Interestingly, several of the most meaningful paths are labelled with PAM or PAW.

4. Waste time-series analysis

Summary of main conclusions per subsection:

  1. Displaying the time-series for the most meaningful waste components;
  2. Modeling the garbage collection component using Statsmodels UnobservedComponents;
  1. Apart from the lag 1 autoregressive coefficients, the others are not statistically significant. It therefore seems like the model is overfitting the data.
  1. At a 5% confidence level, we cannot reject the null that the residuals are heteroskedastic.
  1. At a 5% confidence level, we cannot reject the null that the residuals are normal. The distribution of residuals however appears to be bimodal and has a kurtosis significantly different from 3 - although with such a small number of points it is hard to be very confident in the tests.

  2. The null hypothesis that there is no serial correlation of the residuals (Ljung-Box Q statistic) is rejected at ~1% CL (it is actually lower). This means that the specifying 1 single lag for the autocorrelation is not sufficient, we would likely need more lags to soak up the autocorrelation. This can be also seen below when we study the ACF and PACF of the residuals.

  1. We have created a model that can generate a monthly forecast of the amount of waste. However, we stress that out-of-sample performance needs to be assessed. Furthermore, the lack of statistical significance of the coefficients is, in our view, troublesome.
  1. Modeling using Statsmodels SeasonalDecompose. This is not a fundamental model as above, instead using moving average to fit the data. Therefore, it does not provide a great deal of insight with respect to the data generating process. Nevertheless, it is useful as a benchmark and to understand some features of the model:

Finally, conclusions from comparing the two models:

1. Displaying the time-series for the most meaningful waste components

Yard trimmngs disappeared because these started being accounted for in "Organics".

Let us plot organics plus yard trimming together - remember that organics starts in 2018:

We decided not to model the total organics component - even though it has an interesting seasonal patter - because the classification of the data changed (see above), making it harder to forecast.

2. Modeling the garbage collection component using Statsmodels UnobservedComponents

Let us try to model a normalized version of the garbage time series:

There is a periodic structure to the time-series. Let us use autocorrelation and partial autocorrelation to determine the lags:

As expected, there is a 12 month seasonality in the data. There might also be a second autocorrelation effect a period of 36 months, but it is not obvious to us what physical effect this might be.

Trying a statsmodels UnobservedComponents model:

Conclusions:

  1. Apart from the lag 1 autoregressive coefficients, the others are not statistically significant. It therefore seems like the model is overfitting the data.
  2. The null hypothesis of no heteroskedasticity (i.e, their variance is not constant) is rejected at the 39% confidence level (CL). This means that, at a 5% confidence level, we cannot reject the null.
  3. The null hypothesis of normality is rejected at the 15% confidence level. This is also clear from the fact that the distribution of residuals appears to be bimodal and has a kurtosis significantly different from 3 - although with such a small number of points it is hard to be very confident in the tests.
  4. The null hypothesis that there is no serial correlation of the residuals (Ljung-Box Q statistic) is rejected at ~1% CL (it is actually lower). This means that the specifying 1 single lag for the autocorrelation is not sufficient, we would likely need more lags to soak up the autocorrelation. This can be also seen below when we study the ACF and PACF of the residuals.

The level is changing with time - I wonder if this is due to the business cycle (see below). As for the seasonal pattern, it looks like we are capturing the peaks and valleys, although the amplitude is lower than actual.

The model undershoots at the start of the sample and overshoots at the end of the sample.

Let us check the residuals correlation structure:

Including an autoregressive component removes the autocorrelation at lag 1 in the residuals and improves the overall fit of the model, but at the expense of statistically insignificant coefficients.

It seems like we are at least roughly capturing the evolution of the waste level. With this model, we can now generate forecasts:

Main conclusion: As it is usual with time-series forecasting, the model has less variation (amplitude-wise) than the actual time series, but it does seem to capture the periodicity of the data.

An extra necessary step to validate the model would be to test it out-of-sample, by comparing its predictions with 2020 and 2021 (which we have not done).

3. Modeling using Statsmodels SeasonalDecompose

Summary of the main conclusions:

Decomposing the time-series into trend, seasonality and residual using moving averages.

Conclusions:

The spring and summer months are those that generate the most trash (i.e, positive seasonality) - except for August (maybe people are on vacation). In December, there is a tick up probably due to Christmas, which is followed up in Jan (NYE?).

Conclusions from comparing the two models:

5. COVID effect - Google Mobility data

Summary of main conclusions :

The reduction in retail/services mobility occurs simultaneously with an increase in residential mobility and an amount of waste compared to 2019. This can be explained by a change in consumption patterns.

Let us now analyze the mobility data and compare this with the increase in waste volume in 2020 to check our hypothesis that consumption changed from goods and services to mainly goods. This should lead to a general increase in waste throughout 2020.

Furthermore, since good consumption shifted from buying in physical stores to online shopping, we expect the amount of recycled waste to have increased even more given the increase in waste generated from mailed packages (cardboard boxes, plastic bags, styrofoam protective packages, etc...).

Biggest reduction in the retail/recreation sector.

It seems to agree with our hypothesis but humility is due since we only have one observation. The effect is similar (slightly more pronounced) for recycling.

This large (negative) shift in mobility was due to restrictions imposed by the city of Austin, in particular, the 17 of March ordinance requiring the closure of bars and dining areas, the 20 of March state social distancing / social gathering restrictions and the 24 of March stay at home orders. On the 28 of March, the Parks and Recreation Department closed all park amenities, aside from water fountains and restrooms. For more details, see here.

6. Demographics analysis

Summary of main conclusions:

Waste per person reduced from 2012 until 2020 is due to the population growth being faster than that of waste volume. However, due to Covid (as we have shown before), the large increase in the amount of waste has caused the ratio to increase to levels above those of 2012.

Because we want to complete the year of 2021 with data, we will plug the values of 2020 in the missing values of 2021, adjusted for the level of 2021 waste.

The fact that waste per person reduced from 2012 until 2020 is due to the population growth being faster than that of waste volume. However, due to Covid (as we have shown before), the large increase in the amount of waste has caused the ratio to increase to levels above those of 2012.

This is a warning for policy makers that changes in consumer patterns can dramatically alter waste consumption and that policy proactivity is needed in order to counter the negative effects of an increased amount of waste.

🖼️ Visualisations

Copy here the most important visualizations (graphs, charts, maps, images, etc). You can refer to them in the Executive Summary.

Technical note: If not all the visualisations are visible, you can still include them as an image or link - in this case please upload them to your own repository.

👓 References

List all of the external links (even if they are already linked above), such as external datasets, papers, blog posts, code repositories and any other materials.

All the websites below are from the city of Austin, Texas official webpage or its open data portal unless stated otherwise:

Waste collection & diversion report

Curbside composting collection

Expansion of the compositing program

Yard trimming collection

Recycling

Demographics data

Single-stream recycling (Wiki)

Google Mobility data (Google)

TDS Landfill (Youtube)

OECD Composite Leading (Economic) Indicator

⏭️ Appendix

Add here any code, images or text that you still find relevant, but that was too long to include in the main report. This section is optional.

Ideas that we explored but did not come to fruition/deserve more investigation:

  1. The large holidays (X-mas, Thanksgiving, 4th of July) are local maxima. The city of Austin could implement targeted measures to drive more recycling rather than undifferentiated waste. This is especially relevant in the US where it is common to use disposable plates/forks&knives/cups in parties.

  2. Following the above idea, check if days where the University of Texas (UT) football team plays (the stadium is in the center of Austin) lead to an increased amount of waste. This is because American football is a huge thing in Texas, especially at the college level. Tailgating in the vicinity of the stadium, or getting friends together to party during/after the game is extremely common. Since the stadium of the University of Texas football team is in Austin, one is likely to see an increase in waste in the days in which UT plays - see here for their 2021 playing schedule. Since the 2021 season was delayed due to COVID, it is probably better to look at the 2019 season. However, we did not find any meaningful increase of waste in the data (see image below - the white squared-like shape just above Austin at the center of the image) partially because the area of the stadium does not appear in the waste data. If we could show that there is an increase in waste, we could target UT's management and ask them to enforce recycling policies - as providing refunds for disposable plates/cups, have UT football legends advertise this issue, etc...

  3. The city of Austin has on its website The City of Austin is committed to a zero waste goal to reduce the amount of trash sent to landfills by 90% by the year 2040. We tried to verify this claim (because it seems unlikely to be feasable given the lack of reduction of garbage collections), but were not able to.

  4. Waste hotspots (see below). There are areas of the city that produce more waste than others. With this information, policy makers could try to design targeted community programms to reduce waste.

newplot.png